Probabilistic Methods for Structured Document Classification at INEX'07

نویسندگان

Luis M. de Campos

Juan M. Fernández-Luna

Juan F. Huete

Alfonso E. Romero

چکیده

This paper exposes the results of our participation in the Document Mining track at INEX’07. We have focused on the task of classification of XML documents. Our approach to deal with structured document representations uses classification methods for plain text, applied to flattened versions of the documents, where some of their structural properties have been translated to plain text. We have explored several options to convert structured documents into flat documents, in combination with two probabilistic methods for text categorization. The main conclusion of our experiments is that taking advantage of document structure to improve classification results is a difficult task.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Methods for Link-Based Classification at INEX 2008

In this paper we propose a new method for link-based classification using Bayesian networks. It can be used in combination with any content only probabilistic classsifier, so it can be useful in combination with several different classifiers. We also report the results obtained of its application to the XML Document Mining Track of INEX’08.

متن کامل

Link-Based Text Classification Using Bayesian Networks

In this paper we propose a new methodology for link-based document classification based on probabilistic classifiers and Bayesian networks. We also report the results obtained of its application to the XML Document Mining Track of INEX’09.

متن کامل

INEX 2005 Multimedia Track

This paper reports on the activities of the INEX 2005 Multimedia track. The track was successful in realizing its objective to provide a pilot evaluation platform for the evaluation of retrieval strategies for XML-based multimedia documents. In this first exploratory year the focus of the evaluation experiment was to test approaches for the retrieval of XML fragments using a combination of cont...

متن کامل

Cheshire II at INEX: Using a Hybrid Logistic Regression and Boolean Model for XML Retrieval

This paper describes the retrieval approach that Berkeley used in the INEX evaluation. The primary approach is the combination of a probabilistic methods using a Logistic regression algorithm for estimation of collection relevance and element relevance, along with Boolean constraints. The paper also discusses our approach to XML component retrieval and how component and document retrieval are c...

متن کامل

Cheshire II at INEX ’03: Component and Algorithm Fusion for XML Retrieval

This paper describes the retrieval approach that UC Berkeley used in the 2003 INEX evaluation. As in last year’s INEX, our primary approach is the combination of a probabilistic methods using a Logistic regression algorithm for estimation of document (article) relevance and/or element relevance, along with Boolean constraints. This year we also used data fusion techniques to combine results fro...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Probabilistic Methods for Structured Document Classification at INEX'07

نویسندگان

چکیده

منابع مشابه

Probabilistic Methods for Link-Based Classification at INEX 2008

Link-Based Text Classification Using Bayesian Networks

INEX 2005 Multimedia Track

Cheshire II at INEX: Using a Hybrid Logistic Regression and Boolean Model for XML Retrieval

Cheshire II at INEX ’03: Component and Algorithm Fusion for XML Retrieval

عنوان ژورنال:

اشتراک گذاری